Loading libraries
Setting data location, make sure the files below are available so the rest runs.
datadir <- "../data"
list.files(datadir)
[1] "detroit-311.csv" "detroit-blight-violations.csv"
[3] "detroit-crime.csv" "detroit-demolition-permits.tsv"
[5] "detroit-PARCELS.csv"
First we load all data
Data permits exploration
head(data_permits)
First converting data fields to the right types
#converting to factors
cols <- c("CASE_TYPE", "CASE_DESCRIPTION", "LEGAL_USE", "BLD_PERMIT_TYPE",
"PERMIT_DESCRIPTION", "BLD_PERMIT_DESC", "BLD_TYPE_USE", "RESIDENTIAL",
"DESCRIPTION", "BLD_TYPE_CONST_COD", "BLD_ZONING_DIST", "BLD_USE_GROUP",
"BLD_BASEMENT", "FEE_TYPE", "CSF_CREATED_BY","CONDITION_FOR_APPROVAL")
data_permits %<>% mutate_each_(funs(factor(.)),cols)
#converting $$ to numeric
cols <- c("PCF_AMT_PD", "PCF_AMT_DUE", "PCF_UPDATED","ESTIMATED_COST")
data_permits %<>% mutate_each_(funs(from_currency(.)),cols)
#converting to dates
cols <-c("PERMIT_APPLIED","PERMIT_ISSUED","PERMIT_EXPIRES")
data_permits %<>% mutate_each_(funs(parse_date_time(.,orders="mdy",tz="America/Detroit")),cols)
3 failed to parse. 3 failed to parse. 3 failed to parse.
summary(data_permits)
PERMIT_NO PERMIT_APPLIED PERMIT_ISSUED
Length:7133 Min. :2010-05-20 00:00:00 Min. :2010-05-20 00:00:00
Class :character 1st Qu.:2012-08-15 00:00:00 1st Qu.:2012-08-15 00:00:00
Mode :character Median :2013-03-12 00:00:00 Median :2013-03-12 00:00:00
Mean :2013-07-07 20:39:12 Mean :2013-07-07 22:37:45
3rd Qu.:2014-10-15 00:00:00 3rd Qu.:2014-10-17 00:00:00
Max. :2015-08-28 00:00:00 Max. :2015-08-28 00:00:00
NA's :3 NA's :3
PERMIT_EXPIRES SITE_ADDRESS BETWEEN1 PARCEL_NO
Min. :2010-11-20 00:00:00 Length:7133 Length:7133 Length:7133
1st Qu.:2013-02-23 00:00:00 Class :character Class :character Class :character
Median :2013-09-08 00:00:00 Mode :character Mode :character Mode :character
Mean :2013-12-03 12:13:46
3rd Qu.:2015-02-28 00:00:00
Max. :2016-12-05 00:00:00
NA's :183
LOT_NUMBER SUBDIVISION CASE_TYPE CASE_DESCRIPTION
Length:7133 Length:7133 BLD:7133 Building Permit:7133
Class :character Class :character
Mode :character Mode :character
LEGAL_USE ESTIMATED_COST PARCEL_SIZE PARCEL_CLUSTER_SECTOR
ONE FAMILY DWELLING :2709 Min. : 0 Min. : 0 Min. : 1.000
RES : 920 1st Qu.: 4525 1st Qu.: 3354 1st Qu.: 2.000
ONE FAMILY : 798 Median : 13470 Median : 4095 Median : 5.000
TWO FAMILY DWELLING : 479 Mean : 18955 Mean : 13540 Mean : 4.656
SINGLE FAMILY DWELLING: 358 3rd Qu.: 18750 3rd Qu.: 4932 3rd Qu.: 7.000
(Other) :1425 Max. :448200 Max. :2554576 Max. :10.000
NA's : 444 NA's :6791 NA's :45 NA's :11
STORIES PARCEL_FLOOR_AREA PARCEL_GROUND_AREA PRC_AKA_ADDRESS BLD_PERMIT_TYPE
Min. : 1.000 Min. : 0 Min. : 0.0 Length:7133 DISM :5859
1st Qu.: 1.000 1st Qu.: 0 1st Qu.: 630.0 Class :character Dismantle:1274
Median : 1.500 Median : 0 Median : 768.0 Mode :character
Mean : 2.126 Mean : 6830 Mean : 742.3
3rd Qu.: 2.000 3rd Qu.: 0 3rd Qu.: 912.0
Max. :26.000 Max. :5102782 Max. :27316.0
NA's :1426 NA's :45 NA's :45
PERMIT_DESCRIPTION BLD_PERMIT_DESC BLD_TYPE_USE
Dismantle:5859 WRECK AND REMOVE DEBRIS:3077 28 :4265
NA's :1274 WRECK & REMOVE DEBRIS :2060 One Family Dwelling:1121
WRECK & REMOVE DEBRIS. : 469 54 : 944
Wrecking Permit : 292 6 : 260
WRECK AND REMOVE : 31 26 : 92
(Other) : 213 (Other) : 399
NA's : 991 NA's : 52
RESIDENTIAL DESCRIPTION BLD_TYPE_CONST_COD BLD_ZONING_DIST
NON-RESIDENTIAL: 659 One Family Dwelling:4265 5B :6456 R1 :1350
RESIDENTIAL :6474 Two Family Dwelling: 944 3B : 323 R2 : 790
Commercial Building: 260 2B : 61 B4 : 96
Multiple Dwelling : 92 1B : 35 R3 : 44
School : 42 5A : 23 M4 : 29
(Other) : 256 (Other): 34 (Other): 104
NA's :1274 NA's : 201 NA's :4720
BLD_USE_GROUP BLD_BASEMENT FEE_TYPE CSM_CASENO CSF_CREATED_BY SEQ_NO
R3 :6216 N :1051 WPMT :7037 Length:7133 L-FW :1636 Min. :1
M : 98 Y :5226 BCP : 27 Class :character RSA :1435 1st Qu.:1
R2 : 72 NA's: 856 BPM5 : 12 Mode :character GRE :1105 Median :1
B : 60 REIN : 9 H-AP : 641 Mean :1
E : 53 SPEC : 9 M-JD : 488 3rd Qu.:1
(Other): 189 WPM2 : 9 M-LR : 444 Max. :1
NA's : 445 (Other): 30 (Other):1384
PCF_AMT_PD PCF_AMT_DUE PCF_UPDATED OWNER_LAST_NAME
Min. : 0.0 Min. : 0.0 Min. : 1313 Length:7133
1st Qu.: 108.0 1st Qu.: 238.0 1st Qu.: 8615 Class :character
Median : 238.0 Median : 238.0 Median : 41211 Mode :character
Mean : 257.3 Mean : 274.1 Mean : 90674
3rd Qu.: 238.0 3rd Qu.: 238.0 3rd Qu.: 72085
Max. :20494.2 Max. :20494.2 Max. :101112414
NA's :1274 NA's :471
OWNER_FIRST_NAME OWNER_ADDRESS1 OWNER_ADDRESS2 OWNER_CITY
Length:7133 Length:7133 Length:7133 Length:7133
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
OWNER_STATE OWNER_ZIP CONTRACTOR_LAST_NAME CONTRACTOR_FIRST_NAME
Length:7133 Length:7133 Length:7133 Length:7133
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
CONTRACTOR_ADDRESS1 CONTRACTOR_ADDRESS2 CONTRACTOR_CITY CONTRACTOR_STATE
Length:7133 Length:7133 Length:7133 Length:7133
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
CONTRACTOR_ZIP
Min. :48009
1st Qu.:48204
Median :48209
Mean :48230
3rd Qu.:48227
Max. :92606
NA's :179
CONDITION_FOR_APPROVAL
Call for inspection upon compliance.Ord. 290-H, 12-11-14.3, ET SEQ.CC: OWNER: Borman LLC. 30600 Telegraph Rd. Bingham Farms, MI 48025CC: TENANT: LD Acquistion Co. P.O. Box 3429 El Segunda, CA 90245: 1
Remove all foundations from site and backfill.Ord. 290-H, 12-11-19.9, ET SEQ. : 1
NA's :7131
site_location owner_location contractor_location geom
Length:7133 Length:7133 Length:7133 Length:7133
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Extracting building lat longs from “site_location” variable
data_permits %<>%
#filter out permits that have no lat/long
filter(grepl("\\([0-9\\.\\-]+, *[0-9\\.\\-]+\\)",site_location)) %>%
#extracting lat longs
mutate(lat = as.double(sub(".*\\(([0-9\\.\\-]+),.*","\\1", site_location))) %>%
mutate(long = as.double(sub(".*, *([0-9\\.\\-]+).*","\\1", site_location))) %>%
mutate(address_only = sub("([^\\(]+)\\([0-9\\.\\-]+,.*","\\1", site_location))
Create a list of buildings doing some magic to remove those entries whose lat/long stdev is larger than 10e-4 (~11m). Not much gets removed actually, but it’s consistent with steps below. Removed those records whose address was missing (only ~40).
bld_list_permit <- data_permits %>%
mutate(r = sqrt(PARCEL_SIZE/pi) ) %>%
select(address=address_only, PARCEL_NO, LOT_NUMBER, PERMIT_ISSUED, PARCEL_SIZE, lat, long, r) %>%
filter(! grepl("\\([0-9\\.\\-]+, *[0-9\\.\\-]+\\)",address)) %>%
arrange(address, desc(PERMIT_ISSUED)) %>%
group_by(address) %>%
mutate(sdlat=sd(lat), sdlong=sd(long)) %>%
filter((sdlat<10e-4 & sdlong<10e-4) | (is.na(sdlat) | is.na(sdlong))) %>%
filter(long > -83.3 & long < -82.8) %>%
filter(lat > 42.2 & lat < 42.5) %>%
arrange(PERMIT_ISSUED) %>%
summarise(n_permits=n(), last_permit=last(PERMIT_ISSUED),
lat=median(lat), long=median(long), r=last(r))
head(bld_list_permit)
Function to return number of bligthed records for a specific lat/long coordinate. This function is used to assign blight computing the distance directly in degrees 0.0001 ~ 11m ~ 37ft, which is faster than using a built in function such as distGeo.
in_blight <- function(lt, ln, data) {
data %>%
filter(sqrt((lat-lt)^2+(long-ln)^2)<rdegr+0.0001) %>%
nrow()
}
#creating the dataframe of bligthed buildings
indata <- bld_list_permit %>%
filter(!is.na(r)) %>%
unique() %>%
select(address, lat, long, r)
indata_degrees <- indata %>%
mutate(rdegr = 0.0001/37 * r)
Loading blight violation incidents
head(data_violations)
Data Violation exploration, converting first to the right data fields
#converting to factors
cols <- c("AgencyName","ViolationCode","Disposition","PaymentStatus","Void",
"ViolationCategory","Country")
data_violations %<>% mutate_each_(funs(factor(.)),cols)
#converting $$ to numeric
cols <- c("FineAmt","AdminFee","LateFee","StateFee","CleanUpCost","JudgmentAmt")
data_violations %<>% mutate_each_(funs(from_currency(.)),cols)
#not converting to dates as the dates in the fields below have weird years
cols <-c("TicketIssuedDT","HearingDT")
#data_violations %<>% mutate_each_(funs(from_currency(.)),cols)
summary(data_violations)
TicketID TicketNumber AgencyName
Min. : 18645 Length:307804 Building and Safety Engineering Department:173311
1st Qu.:101806 Class :character Department of Public Works :112757
Median :183824 Mode :character Detroit Police Department : 12853
Mean :182967 Health Department : 8881
3rd Qu.:265211 Neighborhood City Halls : 2
Max. :339184
ViolName ViolationStreetNumber ViolationStreetName MailingStreetNumber
Length:307804 Min. : -11064 Length:307804 Min. : 0
Class :character 1st Qu.: 4936 Class :character 1st Qu.: 3107
Mode :character Median : 10624 Mode :character Median : 9359
Mean : 12001 Mean : 18527
3rd Qu.: 15895 3rd Qu.: 18251
Max. :222222222 Max. :222222222
NA's :808
MailingStreetName MailingCity MailingState MailingZipCode
Length:307804 Length:307804 Length:307804 Length:307804
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
NonUsAddressCode Country TicketIssuedDT TicketIssuedTime
Length:307804 US : 18062 Length:307804 Length:307804
Class :character United Kingdom: 97 Class :character Class1:hms
Mode :character Singapore : 96 Mode :character Class2:difftime
Canada : 71 Mode :numeric
Australia : 22
(Other) : 136
NA's :289320
HearingDT CourtTime ViolationCode ViolDescription
Length:307804 Length:307804 9-1-36(a) :106521 Length:307804
Class :character Class1:hms 9-1-81(a) : 45036 Class :character
Mode :character Class2:difftime 9-1-104 : 38612 Mode :character
Mode :numeric 22-2-88 : 28730
22-2-88(b): 21804
9-1-110(a): 6779
(Other) : 60322
Disposition FineAmt AdminFee LateFee
Responsible By Default :172128 Min. : 0.0 Min. :20 Min. : 0.0
Not responsible By Dismissal : 53543 1st Qu.: 125.0 1st Qu.:20 1st Qu.: 10.0
Not responsible By City Dismissal: 44380 Median : 250.0 Median :20 Median : 25.0
Responsible By Admission : 16432 Mean : 360.2 Mean :20 Mean : 35.8
Responsible By Determination : 10233 3rd Qu.: 250.0 3rd Qu.:20 3rd Qu.: 25.0
Not responsible By Determination : 7809 Max. :10000.0 Max. :20 Max. :1000.0
(Other) : 3279 NA's :1973
StateFee CleanUpCost JudgmentAmt PaymentStatus
Min. :10 Min. : 0.000 Min. : 0.0 NO PAYMENT APPLIED :244303
1st Qu.:10 1st Qu.: 0.000 1st Qu.: 140.0 NO PAYMENT ON RECORD: 14565
Median :10 Median : 0.000 Median : 305.0 PAID IN FULL : 44319
Mean :10 Mean : 0.515 Mean : 425.2 PARTIAL PAYMENT MADE: 4617
3rd Qu.:10 3rd Qu.: 0.000 3rd Qu.: 305.0
Max. :10 Max. :13123.800 Max. :11030.0
NA's :1972
Void ViolationCategory ViolationAddress MailingAddress
0 : 99133 0:305787 Length:307804 Length:307804
NA's:208671 1: 2017 Class :character Class :character
Mode :character Mode :character
Getting the violation codes, we’ll manually categorize them below
violCodes <- data_violations %>%
select(ViolationCode, ViolDescription) %>%
unique()
Generated lat/longs and address, and cleaned data by keeping records with Detroit addresses only. Didn’t filter by country as it removed most of the entries (from 300k to 13k records). Didn’t remove disposition “not reposonsible” or “pending” as this could contain information
viol_list <- data_violations %>%
# filter(Country == "US") %>%
filter(grepl("\\([0-9\\.\\-]+, *[0-9\\.\\-]+\\)",ViolationAddress)) %>%
#extracting lat longs
mutate(lat = as.double(sub(".*\\(([0-9\\.\\-]+),.*","\\1", ViolationAddress))) %>%
mutate(long = as.double(sub(".*, *([0-9\\.\\-]+).*","\\1", ViolationAddress))) %>%
mutate(address_only = sub("([^\\(]+)\\([0-9\\.\\-]+,.*","\\1", ViolationAddress)) %>%
filter(grepl("Detroit",ViolationAddress)) %>%
# filter(! grepl("Not responsible",Disposition)) %>%
# filter(! grepl("PENDING", Disposition)) %>%
select(lat, long, ViolationCode, Disposition, JudgmentAmt, PaymentStatus, ViolationCategory, address_only)
head(viol_list)
What happens with Disposition? Well, it seems hat it could be categorized as responsible, not responsible, and pending
viol_list %>%
select(Disposition) %>%
group_by(Disposition) %>%
summarise(n())
There are about 80k unique lat/longs, some of them generate a huge amount of violations, here are the top 30 lat/longs
top30viols <- viol_list %>%
select(lat, long) %>%
mutate(geocord = paste(lat,long)) %>%
group_by(geocord) %>%
summarize(lat=last(lat), long=last(long), num_viols_in_geo = n()) %>%
arrange(desc(num_viols_in_geo)) %>%
head(30)
top30viols
Plotting them, we see that the top one has 21k violations and it’s in the center of Detroit, probably a standard lat/long coordinate when not the actual is not available. I’m not sure about the others with over 1000 violations…
p <- gmap(lat = 42.37, lng = -83.10, zoom = 11, width = 600, height = 350,
map_style = gmap_style("apple_mapsesque")) %>%
ly_points(long, lat, data = top30viols, hover = num_viols_in_geo,
col = 'red', alpha = pmin(num_viols_in_geo / 1000, 1)) %>%
x_axis(visible = FALSE) %>%
y_axis(visible = FALSE)
p
Are there the same number of unique addresses?
viol_list %>%
select(address_only) %>%
group_by(address_only) %>%
summarize(num_viols_in_address = n()) %>%
arrange(desc(num_viols_in_address))
Well, it seems that there are about 110k unique addresses, and 73k unique lat/longs. Let’s see how they contrast in terms of number of violations (lat/long vs. address)
viol_list_cleaned <- viol_list %>%
mutate(geocoord = paste(lat,long)) %>%
group_by(geocoord) %>%
mutate(num_viols_in_geocoord = n())
viol_list_cleaned %<>%
group_by(address_only) %>%
mutate(num_viols_in_address = n())
nrow(viol_list_cleaned)
[1] 307804
It seems that about 70k entries have different number of violations when looking by address or lat/long, being 35k unique records duplicated, so getting rid of them
viol_list_cleaned %<>%
filter(! num_viols_in_address != num_viols_in_geocoord) %>%
group_by(ViolationCode, address_only) %>%
mutate(num_viols_by_vcode = n()) %>%
arrange(desc(num_viols_in_geocoord)) %>%
ungroup() %>%
unique()
nrow(viol_list_cleaned)
[1] 220116
After cleaning, here are the top violations counts per geocoord/address
viol_list_cleaned %>%
select(address_only, lat, long) %>%
mutate(geocord = paste(lat,long)) %>%
group_by(geocord) %>%
summarize(address=last(address_only), lat=last(lat), long=last(long), num_viols_in_geo = n()) %>%
arrange(desc(num_viols_in_geo)) %>%
select(-geocord) %>%
head(30)
There are 313 violation codes, some of them being more frequent than others
violCodes <- viol_list_cleaned %>%
# mutate(ViolationCode = sub("^([0-9]+-[0-9]+)-.*$","\\1",ViolationCode)) %>%
group_by(ViolationCode) %>%
tally(sort=TRUE)
violCodes
Categorizing them semantically manually, we reduce them to 12 groups, n being the number of categories mapped to each corresponding group
violCodes_manual_categorization <-
read_csv("violCodes_manual_categorization.csv")
Parsed with column specification:
cols(
ViolGroup = col_character(),
ViolationCode = col_character(),
ViolDescription = col_character()
)
violCodes_manual_categorization %<>%
mutate(ViolGroup=as.factor(ViolGroup),
ViolationCode=as.factor(ViolationCode))
violCodes_manual_categorization %>% group_by(ViolGroup) %>% tally(sort = TRUE)
We adding the grouping factor for violation categories, ViolGroup
viol_list_cleaned %<>%
left_join(violCodes_manual_categorization,by="ViolationCode")
joining factors with different levels, coercing to character vector
#Some codes had no description, to prevent to NAs, grouping them to other
viol_list_cleaned[which(is.na(viol_list_cleaned$ViolGroup)),]$ViolGroup <- "other"
head(viol_list_cleaned)
Expand ViolGroup counts as separate features
violcodes_counts <- viol_list_cleaned %>%
select(address=address_only, lat, long, ViolGroup, Disposition,
JudgmentAmt, PaymentStatus) %>%
group_by(address, ViolGroup) %>%
summarize(num_viol_by_code = n()) %>%
ungroup() %>%
spread(ViolGroup, num_viol_by_code, fill = 0)
head(violcodes_counts)
Getting a list of buildings, the grouping wouldn’t be necessary as there is a one to one lat/long to address correspondance, but leaving it just in case.
bld_list_viol <- viol_list_cleaned %>%
select(address=address_only, lat, long, ViolationCode, Disposition,
JudgmentAmt, PaymentStatus) %>%
filter(long > -83.3 & long < -82.8) %>%
filter(lat > 42.2 & lat < 42.5) %>%
mutate(Disposition = ifelse(grepl("^Responsible", Disposition),"Responsible","Not Responsible or Pending")) %>%
group_by(address) %>%
mutate(sdlat=sd(lat), sdlong=sd(long)) %>%
filter((sdlat<10e-4 & sdlong<10e-4) | (is.na(sdlat) | is.na(sdlong))) %>%
group_by(address, Disposition) %>%
mutate(num_disposition = n()) %>%
group_by(address) %>%
mutate(num_viols = n(), max_amt = max(JudgmentAmt)) %>%
filter(Disposition == "Responsible") %>%
summarise(lat=median(lat), long=median(long), num_viols = last(num_viols),
num_responsible = last(num_disposition), max_amt =last(max_amt)) %>%
unique()
head(bld_list_viol)
Adding the violation code counts
bld_list_viol %<>%
left_join(violcodes_counts, by="address")
colnames(bld_list_viol) <- make.names(colnames(bld_list_viol))
head(bld_list_viol)
Exploring 311 data
#converting to factors
cols <- c("issue_type", "ticket_status")
data_311 %<>% mutate_each_(funs(factor(.)),cols)
#converting to dates
cols <-c("ticket_closed_date_time", "acknowledged_at", "ticket_created_date_time",
"ticket_last_updated_date_time")
data_311 %<>% mutate_each_(funs(parse_date_time(.,orders="mdY HMS Op",tz="America/Detroit")),cols)
#dplyr::glimpse(data_311)
summary(data_311)
ticket_id city issue_type
Min. :1184398 Length:19680 Illegal Dumping / Illegal Dump Sites:3584
1st Qu.:1591936 Class :character Tree Issue :3546
Median :1705228 Mode :character Running Water in a Home or Building :2655
Mean :1699224 Clogged Drain :2490
3rd Qu.:1838305 Potholes :2399
Max. :1975499 Traffic Sign Issue :1030
(Other) :3976
ticket_status issue_description rating ticket_closed_date_time
Acknowledged:3012 Length:19680 Min. : 1.000 Min. :2014-07-14 16:24:49
Archived :9600 Class :character 1st Qu.: 2.000 1st Qu.:2015-04-14 13:14:10
Closed :7009 Mode :character Median : 3.000 Median :2015-06-15 13:52:22
Open : 59 Mean : 2.693 Mean :2015-06-01 22:16:27
3rd Qu.: 3.000 3rd Qu.:2015-08-13 15:31:53
Max. :19.000 Max. :2015-10-15 02:54:02
NA's :3175
acknowledged_at ticket_created_date_time ticket_last_updated_date_time
Min. :2014-07-14 14:05:07 Min. :2014-07-14 12:54:47 Min. :2014-07-14 17:06:47
1st Qu.:2015-04-16 12:46:43 1st Qu.:2015-04-14 15:37:22 1st Qu.:2015-04-21 09:06:55
Median :2015-06-15 12:58:42 Median :2015-06-10 13:55:45 Median :2015-06-22 11:44:41
Mean :2015-06-05 01:45:13 Mean :2015-06-01 17:55:42 Mean :2015-06-10 10:54:26
3rd Qu.:2015-08-11 13:51:43 3rd Qu.:2015-08-10 12:07:14 3rd Qu.:2015-08-21 09:12:26
Max. :2015-10-14 20:43:23 Max. :2015-10-15 00:32:16 Max. :2015-10-15 02:54:02
NA's :2023
address lat lng location image
Length:19680 Min. :41.88 Min. :-86.55 Length:19680 Length:19680
Class :character 1st Qu.:42.36 1st Qu.:-83.19 Class :character Class :character
Mode :character Median :42.39 Median :-83.11 Mode :character Mode :character
Mean :42.39 Mean :-83.11
3rd Qu.:42.42 3rd Qu.:-83.04
Max. :42.45 Max. :-82.91
There are 23 types of 311 issues, we could extract counts for each
types311 <- data_311 %>%
group_by(issue_type) %>%
summarise(num_311type = n()) %>%
arrange(desc(num_311type))
types311
Generating features for each issue type as a count per address
type311counts <- data_311 %>%
select(address, issue_type) %>%
group_by(address, issue_type) %>%
summarize(num_type311 = n()) %>%
ungroup() %>%
spread(issue_type, num_type311, fill = 0)
head(type311counts)
Most tickets are either archived or closed, probably there is no discrimination around that
data_311 %>%
group_by(ticket_status) %>%
summarise(num_tx_status = n()) %>%
arrange(desc(num_tx_status))
Rating could be interesting to use
data_311 %>%
group_by(rating) %>%
summarise(num_rating = n()) %>%
arrange(desc(num_rating))
All entries are for city of Detroit, which is good, but won’t be discriminating
data_311 %>%
group_by(city) %>%
summarise(num_city = n()) %>%
arrange(desc(num_city))
Grouping by address, it seems there are multiple lat/longs per address, removing those entries that have a lat/long STDEV larger than 10e-4 (~11m), removes 62 entries.
bld_list_311 <- data_311 %>%
select(address, lat, long=lng, rating, type311 = issue_type, rating) %>%
filter(long > -83.3 & long < -82.8) %>%
filter(lat > 42.2 & lat < 42.5) %>%
group_by(address) %>%
mutate(sdlat=sd(lat), sdlong=sd(long)) %>%
filter((sdlat<10e-4 & sdlong<10e-4) | (is.na(sdlat) | is.na(sdlong))) %>%
summarize(lat=median(lat), long=median(long), num_311 = n(), max_rating = max(rating), min_rating=min(rating), diff_rating = max(rating) - min(rating)) %>%
unique()
head(bld_list_311)
Most 311 entries don’t change the rating
bld_list_311 %>%
group_by(diff_rating) %>%
tally(sort=TRUE)
Adding the 311 issue type counts
bld_list_311 %<>%
left_join(type311counts, by="address")
colnames(bld_list_311) <- make.names(colnames(bld_list_311))
head(bld_list_311)
Exploring criminal incidents in Detroit
head(data_crime)
#converting to factors
cols <- c("CATEGORY","STATEOFFENSEFILECLASS","PRECINCT","COUNCIL","NEIGHBORHOOD")
data_crime %<>% mutate_each_(funs(factor(.)),cols)
#converting to dates
cols <-c("INCIDENTDATE")
data_crime %<>% mutate_each_(funs(parse_date_time(.,orders="mdY HMS Op",tz="America/Detroit")),cols)
#dplyr::glimpse(data_crime)
summary(data_crime)
ROWNUM CASEID INCINO
Min. : 1 Min. :1612272 Min. :0.000e+00
1st Qu.: 29984 1st Qu.:1930308 1st Qu.:1.504e+09
Median : 59966 Median :1960745 Median :1.506e+09
Mean : 59966 Mean :1960655 Mean :1.506e+09
3rd Qu.: 89948 3rd Qu.:1991078 3rd Qu.:1.509e+09
Max. :119931 Max. :2021621 Max. :1.509e+10
NA's :2
CATEGORY OFFENSEDESCRIPTION STATEOFFENSEFILECLASS
TRAFFIC VIOLATIONS-MOTORCYCLE VIOLATIONS:29412 Length:119931 99009 :17383
ASSAULT :16471 Class :character 13001 :11140
LARCENY :14138 Mode :character 29000 : 9418
DAMAGE TO PROPERTY : 9418 13002 : 8283
AGGRAVATED ASSAULT : 8283 24001 : 6961
BURGLARY : 7764 (Other):66738
(Other) :34445 NA's : 8
INCIDENTDATE HOUR SCA PRECINCT
Min. :2015-01-01 00:00:00 Min. : 0.00 Min. : 201.0 8 :14076
1st Qu.:2015-03-31 00:00:00 1st Qu.: 8.00 1st Qu.: 409.0 3 :13913
Median :2015-06-16 00:00:00 Median :14.00 Median : 711.0 9 :13318
Mean :2015-06-14 10:37:57 Mean :12.91 Mean : 765.1 12 :12874
3rd Qu.:2015-08-30 00:00:00 3rd Qu.:18.00 3rd Qu.:1002.0 6 :11940
Max. :2015-12-10 00:00:00 Max. :23.00 Max. :9999.0 (Other):53039
NA's :771 NA's : 771
COUNCIL NEIGHBORHOOD CENSUSTRACT
City Council District 6:18330 GREENFIELD : 3944 Min. : 9
City Council District 7:18212 STATE FAIR-NOLAN: 3942 1st Qu.: 5114
City Council District 2:16619 WARRENDALE : 3582 Median : 5263
City Council District 3:16287 DENBY : 2955 Mean : 37989
City Council District 5:16230 PEMBROKE : 2888 3rd Qu.: 5393
(Other) :30131 (Other) :97961 Max. :99992016
NA's : 4122 NA's : 4659 NA's :13064
ADDRESS LON LAT LOCATION
Length:119931 Min. : -121.0 Min. : 0.0 Length:119931
Class :character 1st Qu.: -83.2 1st Qu.: 42.4 Class :character
Mode :character Median : -83.1 Median : 42.4 Mode :character
Mean : 2236.2 Mean : 2361.4
3rd Qu.: -83.0 3rd Qu.: 42.4
Max. :999999.0 Max. :999999.0
NA's :59 NA's :59
There are 50 categories of crimes, could be added as counts to the features
crime_categories <- data_crime %>%
group_by(CATEGORY) %>%
tally(sort=TRUE)
crime_categories
Seems that crime is spread evenly, except for districts 3 and 5
data_crime %>%
group_by(COUNCIL) %>%
tally(sort=TRUE)
Grouping by address, it seems there are multiple lat/longs per address, and about 1000 entries have a lat/long STDEV larger than 10e-4 (~11m distance), so I removed those
bld_list_crime <- data_crime %>%
select(address=ADDRESS, lat=LAT, long=LON, typecrime = CATEGORY) %>%
filter(long > -83.3 & long < -82.8) %>%
filter(lat > 42.2 & lat < 42.5) %>%
group_by(address) %>%
mutate(sdlat=sd(lat), sdlong=sd(long)) %>%
filter((sdlat<10e-4 & sdlong<10e-4) | (is.na(sdlat) | is.na(sdlong))) %>%
summarize(lat=median(lat), long=median(long), num_crime = n()) %>%
unique()
head(bld_list_crime)
Generating crime features for each issue type as a count per address
crime_category_counts <- data_crime %>%
select(address=ADDRESS, CATEGORY) %>%
group_by(address, CATEGORY) %>%
summarize(num_category = n()) %>%
ungroup() %>%
spread(CATEGORY, num_category, fill = 0)
head(crime_category_counts)
Adding the crime category counts
bld_list_crime %<>%
left_join(crime_category_counts, by="address")
colnames(bld_list_crime) <- make.names(colnames(bld_list_crime))
head(bld_list_crime)
Prepare a list of coordinates by type of record, at precision of 4 digits (~11m accuracy) REF: https://en.wikipedia.org/wiki/Decimal_degrees#Precision
bld_list_coord <- bld_list_311 %>%
mutate(type = "311") %>%
select(type, lat, lon=long, address) %>%
bind_rows(select(mutate(bld_list_crime, type="crime"), type, lat, lon=long, address)) %>%
bind_rows(select(mutate(bld_list_viol, type="viol"), type, lat, lon=long, address)) %>%
mutate(coord=paste(lat,lon,sep=",")) %>%
arrange(desc(lat), desc(lon))
bld_list_coord
Let’s plot all records of 311 calls, blight violations, and crime reports in a map centered in Detroit. The records are grouped by distance. When zooming in, we see that lat/long coordinates that are about +/-0.0002 degrees appart seem to belong to the same location
library(leaflet)
iconurl <- "https://docs.google.com/drawings/d/11vcxQDH5DQstHFuUf0VhoXoWFx5b8elfZseXPLcbsmE/pub?w=50&h=50"
leaflet() %>%
setView(lat = 42.37, lng = -83.10, zoom = 11) %>%
addTiles(group = "OSM (default)") %>%
addMarkers(data = filter(bld_list_coord,type=="311"), lng = ~lon, lat = ~lat,
clusterOptions = markerClusterOptions(), group = "311 calls",
popup = ~coord, label = ~type) %>%
addMarkers(data = filter(bld_list_coord,type=="crime"), lng = ~lon, lat = ~lat,
clusterOptions = markerClusterOptions(), group = "crime",
popup = ~coord, label = ~type) %>%
addMarkers(data = filter(bld_list_coord,type=="viol"), lng = ~lon, lat = ~lat,
clusterOptions = markerClusterOptions(), group = "blight viol",
popup = ~coord, label = ~type) %>%
addMarkers(data = bld_list_permit, lng = ~long, lat = ~lat,
clusterOptions = markerClusterOptions(), group = "demolition",
popup = ~address, label = "demolition",
icon = list(iconUrl = iconurl, iconsize = c(1,1))) %>%
addLayersControl(
overlayGroups = c("311 calls", "crime", "blight viol","demolition"),
options = layersControlOptions(collapsed = FALSE)
)
Let’s build a list of potential locations to explore by grouping all lat/long coordinates that are around +/-0.0002 degrees appart. This approach yields to 60679 buildings out of 146892 records.
# Allowable error
e_lat = 0.0002
e_lon = 0.0002
bld_list_all_filename <- "extracted_bld_list_elat2-elon2.csv"
# Calculation takes ~30 min, skip it if file above exists to speed up
if( file.exists(bld_list_all_filename) ) {
bld_list_all <- read_csv(bld_list_all_filename)
# file doesn't exists, let's crank the calculation up!
} else {
i <- 0
out <- data_frame()
input <- bld_list_coord %>%
mutate(coord_assigned = 0)
l <- nrow(input)
while(l > 0) {
input %<>%
filter(coord_assigned != 1)
lt <- input[1,]$lat
ln <- input[1,]$lon
input %<>%
mutate(coord_assigned = ifelse( (abs(lon-ln)<=e_lon & abs(lat-lt)<=e_lat), 1, 0) )
i <- i+1
out <- input %>%
filter(coord_assigned == 1) %>%
mutate(bld_id = i, bld_lat = median(lat), bld_lon = median(lon)) %>%
bind_rows(out)
l <- nrow(input)
if(l == 0) break
if(i%%100==0) print(paste0(i," iterations ",l," records left"))
}
bld_list_all <- out
out <- NULL
write_csv(bld_list_all, path = bld_list_all_filename)
}
Parsed with column specification:
cols(
type = col_character(),
lat = col_double(),
lon = col_double(),
address = col_character(),
coord = col_character(),
coord_assigned = col_double(),
bld_id = col_double(),
bld_lat = col_double(),
bld_lon = col_double()
)
Let’s join in the building identifier to the existing records
bld_list_crime <- bld_list_all %>%
filter(type == "crime") %>%
select(bld_id, bld_lat, bld_lon, address) %>%
left_join(bld_list_crime, by="address") %>%
select(-lat, -long) %>%
rename(lat=bld_lat, long=bld_lon)
bld_list_311 <- bld_list_all %>%
filter(type == "311") %>%
select(bld_id, bld_lat, bld_lon, address) %>%
left_join(bld_list_311, by="address") %>%
select(-lat, -long) %>%
rename(lat=bld_lat, long=bld_lon)
bld_list_viol <- bld_list_all %>%
filter(type == "viol") %>%
select(bld_id, bld_lat, bld_lon, address) %>%
left_join(bld_list_viol, by="address") %>%
select(-lat, -long) %>%
rename(lat=bld_lat, long=bld_lon)
Recalculate summaries at the building level for crime
tmp <- bld_list_crime %>%
select(-address) %>%
select(-lat, -long) %>%
group_by(bld_id) %>%
summarise_each(funs(sum))
bld_list_crime %<>%
select(bld_id, lat, long) %>%
distinct() %>%
left_join(tmp, by="bld_id")
head(bld_list_crime)
Recalculate summaries at the building level for 311
tmp <- bld_list_311 %>%
select(-address) %>%
select(-lat, -long, -min_rating, -max_rating, -diff_rating) %>%
group_by(bld_id) %>%
summarise_each(funs(sum))
tmp <- bld_list_311 %>%
select(bld_id, min_rating, max_rating) %>%
group_by(bld_id) %>%
summarise(min_rating = min(min_rating), max_rating = max(max_rating)) %>%
mutate(diff_rating = max_rating - min_rating) %>%
left_join(tmp, by="bld_id")
bld_list_311 %<>%
select(bld_id, lat, long) %>%
distinct() %>%
left_join(tmp, by="bld_id")
head(bld_list_311)
Recalculate summaries at the building level for blight violations
tmp <- bld_list_viol %>%
select(-address) %>%
select(-lat, -long, -max_amt) %>%
group_by(bld_id) %>%
summarise_each(funs(sum))
tmp <- bld_list_viol %>%
select(bld_id, max_amt) %>%
group_by(bld_id) %>%
summarise(max_amt = max(max_amt)) %>%
left_join(tmp, by="bld_id")
bld_list_viol %<>%
select(bld_id, lat, long) %>%
distinct() %>%
left_join(tmp, by="bld_id")
head(bld_list_viol)
Direct join by building identifier, bld_id
bld_list_crime_311_viol <- bld_list_crime %>%
full_join(bld_list_311, by = "bld_id") %>%
full_join(bld_list_viol, by = "bld_id")
bld_list_crime_311_viol %<>%
mutate(lat = ifelse(is.na(lat),ifelse(is.na(lat.x),lat.y,lat.x),lat)) %>%
mutate(long = ifelse(is.na(long),ifelse(is.na(long.x),long.y,long.x),long)) %>%
select(-lat.x, -long.x, -lat.y, -long.y) %>%
select(bld_id, lat, long, everything())
head(bld_list_crime_311_viol)
Adding surrounding statistics, 0.001 ~ 111m
bld_list_crime_311_viol %<>%
mutate(coord_neigh = paste(round(lat,digits=3),round(long,digits=3))) %>%
group_by(coord_neigh) %>%
mutate(
num_crime_neigh = sum(num_crime),
num_311_neigh = sum(num_311),
avg_max_rating_neigh = mean(max_rating),
avg_min_rating_neigh = mean(min_rating),
num_viols_neigh = sum(num_viols),
num_respons_neigh = sum(num_responsible),
avg_max_amt_neigh = mean(max_amt),
total_max_amt_neigh = sum(max_amt)
) %>%
ungroup()
head(bld_list_crime_311_viol)
Taking care of NA’s, doing some dummy imputation
codesviol <- make.names(unlist(lapply(unique(violCodes_manual_categorization$ViolGroup), as.character)))
codes311 <- make.names(unlist(lapply(types311$issue_type, as.character)))
codescrime <- make.names(unlist(lapply(crime_categories$CATEGORY, as.character)))
cols <- c("num_crime", "num_311", "num_viols", "num_responsible",
"num_crime_neigh", "num_311_neigh", "num_viols_neigh",
"num_respons_neigh", "total_max_amt_neigh")
cols <- c(cols, codesviol, codes311, codescrime)
bld_list_crime_311_viol %<>%
mutate_each_(funs(ifelse(is.na(.),0,.)),cols)
cols <- c("max_rating", "min_rating", "diff_rating", "max_amt",
"avg_max_rating_neigh", "avg_min_rating_neigh", "avg_max_amt_neigh")
bld_list_crime_311_viol %<>% mutate_each_(funs(ifelse(is.na(.),-1,.)),cols)
#Let's print the columns that still contain NA's, if character(0) then we are good!
out <- colnames(bld_list_crime_311_viol)[colSums(is.na(bld_list_crime_311_viol)) > 0]
ifelse(length(out)==0,print("No NA is left :)"),out)
[1] "No NA is left :)"
[1] "No NA is left :)"
The mashup results in about 4210 blighted entries, out of 82163 total buildings.
modeling_data <- bld_list_crime_311_viol
modeling_data %<>%
rowwise() %>%
mutate(nblighted = in_blight(lat, long, indata_degrees))
modeling_data %>%
filter(nblighted > 0) %>%
nrow()
[1] 4210
Creating dependent variable “condition” as boolean from “nblighted” (this describes the number of records blighted in the same lat/long)
modeling_data %<>%
mutate(condition = factor(if_else(nblighted > 0, "BLIGHTED", "NOT_BLIGHTED"))) %>%
select(-bld_id, -lat, -long, -nblighted, -coord_neigh)
table(modeling_data$condition)
NOT_BLIGHTED BLIGHTED
77953 4210
Balancing the two classes by downsampling the number of cases of non-blight (not ideal, but helps modeling and compute time)
mod_data_sampled <- modeling_data %>%
filter(condition == "BLIGHTED")
mod_data_sampled <- modeling_data %>%
filter(condition == "NOT_BLIGHTED") %>%
sample_n(5500) %>%
bind_rows(mod_data_sampled)
table(mod_data_sampled$condition)
NOT_BLIGHTED BLIGHTED
5500 4210
Creating a set-aside dataset for test, and a modeling set
set.seed(107)
inTest <- createDataPartition(y=mod_data_sampled$condition, p=0.1, list=FALSE)
testSet <- mod_data_sampled[inTest,]
modelSet <- mod_data_sampled[-inTest,]
table(testSet$condition)
NOT_BLIGHTED BLIGHTED
550 421
Creating a training and validation data sets for model training and validation, the training set is about 8.4k records
set.seed(107)
inValid <- createDataPartition(y=modelSet$condition, p=0.15, list=FALSE)
trainSet <- mod_data_sampled[-inValid,]
validSet <- mod_data_sampled[inValid,]
#we are actually using the validation set as a testset, so adding them together to
#have a better performance estimate
validSet %<>% bind_rows(testSet)
table(trainSet$condition)
NOT_BLIGHTED BLIGHTED
4661 3737
As for the validation set, it is about 1.2k records
table(validSet$condition)
NOT_BLIGHTED BLIGHTED
1389 894
Plotting the histogram of number of violations for both blighted and not blighted buildings, we see some difference that might contribute to discriminate the 2 classes
p <- figure(width = 600, height = 350) %>%
ly_hist(num_viols, data = trainSet[which(trainSet$condition == "NOT_BLIGHTED"),], color = "blue", alpha = 0.25, freq = F, breaks = 25) %>%
ly_hist(num_viols, data = trainSet[which(trainSet$condition == "BLIGHTED"),], color = "red", alpha = 0.25, freq = F, breaks = 25)
p
Let’s train a general linear model (GLM) with 5-fold cross-validation trying to predict the bligth condition only by the number of violations reported in the coordinate, which yields 72.3% AUC on the ROC, 84.2% true positive, but 37.7% true negative rates.
# 5 fold cross-validation, 1 repetition
ctrl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 1,
classProbs = TRUE,
summaryFunction = twoClassSummary
)
# general linear model
glmModel <- train(condition ~ num_viols,
data = trainSet,
method = "glm",
trControl = ctrl,
metric = "ROC"
)
glmModel
Generalized Linear Model
8398 samples
1 predictor
2 classes: 'NOT_BLIGHTED', 'BLIGHTED'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 1 times)
Summary of sample sizes: 6719, 6719, 6718, 6718, 6718
Resampling results:
ROC Sens Spec
0.7235278 0.8423085 0.3770386
A decision tree has a similar performance, 71.8% AUC on ROC for cross-validation, but at a more balanced operating point, with 65.0% true positive and 68.7% true negative.
# decision tree model
dtreeModel <- train(condition ~ num_viols,
data = trainSet,
method = "rpart",
trControl = ctrl,
metric = "ROC"
)
dtreeModel
CART
8398 samples
1 predictor
2 classes: 'NOT_BLIGHTED', 'BLIGHTED'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 1 times)
Summary of sample sizes: 6718, 6719, 6717, 6719, 6719
Resampling results across tuning parameters:
cp ROC Sens Spec
0.0001783962 0.7182296 0.6504980 0.6866661
0.0092320043 0.7025044 0.5736739 0.7712712
0.2389617340 0.5710988 0.7869099 0.3552878
ROC was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.0001783962.
A random forest yields slighlty lower resuts, with an 71.0% AUC on ROC, but also at a balanced operating point, 65.5% true positive, and 68.0% true negative
# random forest model
rfModel <- train(condition ~ num_viols,
data = trainSet,
method = "rf",
trControl = ctrl,
metric = "ROC"
)
invalid mtry: reset to within valid rangeinvalid mtry: reset to within valid rangeinvalid mtry: reset to within valid rangeinvalid mtry: reset to within valid rangeinvalid mtry: reset to within valid rangeinvalid mtry: reset to within valid range
rfModel
Random Forest
8398 samples
1 predictor
2 classes: 'NOT_BLIGHTED', 'BLIGHTED'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 1 times)
Summary of sample sizes: 6719, 6718, 6719, 6718, 6718
Resampling results:
ROC Sens Spec
0.7103946 0.6550085 0.6802379
Tuning parameter 'mtry' was held constant at a value of 2
Comparing the models across cross-validation resamples, we see a similar performance in terms of ROC. The GLM seems to achieve the highest true positive, but at a cost of a low true negative.
resamps <- resamples(list(glm = glmModel, dtree = dtreeModel, rf = rfModel))
summary(resamps)
Other variables/features:
varcrime <- "num_crime"
var311 <- paste("min_rating", "max_rating", "diff_rating","num_311",sep = "+")
varviols <- paste("max_amt","num_viols","num_responsible",sep = "+")
varneigh <- paste("num_crime_neigh", "num_311_neigh","avg_max_rating_neigh",
"avg_min_rating_neigh", "num_viols_neigh", "num_respons_neigh",
"avg_max_amt_neigh", "total_max_amt_neigh",sep="+")
var311codes <- paste(codes311, collapse = "+")
varcrimecodes <- paste(codescrime, collapse = "+")
varviolcodes <- paste(codesviol, collapse = "+")
f_basic <- as.formula(paste("condition ~ ", paste(varcrime, var311, varviols, sep="+")))
f_basic_neigh <- as.formula(paste("condition ~ ", paste(varcrime, var311, varviols, varneigh, sep="+")))
f_all <- as.formula(paste("condition ~ ", paste(varcrime, var311, varviols, varneigh, var311codes, varcrimecodes, varviolcodes, sep="+")))
f_all_noneigh <- as.formula(paste("condition ~ ", paste(varcrime, var311, varviols, var311codes, varcrimecodes, varviolcodes, sep="+")))
The fact that the random forest performance in training and validation sets was similar, eludes that the model in not complex enough.
Adding 311 call and crime data, yields to a slightly better AUC of 72.0% (from 71.8%)
# decision tree model
dtreeModel2 <- train(f_basic,
data = trainSet,
method = "rpart",
trControl = ctrl,
metric = "ROC",
na.action=na.exclude
)
dtreeModel2
CART
8398 samples
8 predictor
2 classes: 'NOT_BLIGHTED', 'BLIGHTED'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 1 times)
Summary of sample sizes: 6718, 6719, 6718, 6718, 6719
Resampling results across tuning parameters:
cp ROC Sens Spec
0.004950495 0.7199713 0.5713362 0.8006339
0.011774150 0.7120725 0.5412974 0.8284786
0.246989564 0.6087019 0.6952790 0.5221249
ROC was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.004950495.
Adding the neighborhood features seems to drop the performance slightly to 71.4% AUC (from 72.0% above)
# decision tree model
dtreeModel3 <- train(f_basic_neigh,
data = trainSet,
method = "rpart",
trControl = ctrl,
metric = "ROC",
na.action=na.exclude
)
dtreeModel3
CART
8398 samples
16 predictor
2 classes: 'NOT_BLIGHTED', 'BLIGHTED'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 1 times)
Summary of sample sizes: 6719, 6718, 6717, 6719, 6719
Resampling results across tuning parameters:
cp ROC Sens Spec
0.003612523 0.7145145 0.6166142 0.7393521
0.009722594 0.6995196 0.5378796 0.8212572
0.246989564 0.5715752 0.7913130 0.3518373
ROC was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.003612523.
Adding all features slight worses the performance to 70.8%AUC
# decision tree model
dtreeModel4 <- train(f_all,
data = trainSet,
method = "rpart",
trControl = ctrl,
metric = "ROC",
na.action=na.exclude
)
dtreeModel4
CART
8398 samples
101 predictor
2 classes: 'NOT_BLIGHTED', 'BLIGHTED'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 1 times)
Summary of sample sizes: 6718, 6719, 6717, 6719, 6719
Resampling results across tuning parameters:
cp ROC Sens Spec
0.005887075 0.7079370 0.6453859 0.6954399
0.009722594 0.7027594 0.6101928 0.7390811
0.246989564 0.6078667 0.6912017 0.5245316
ROC was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.005887075.
Removing the neighborhood features from all brings the performance back up slightly to 71.8% AUC
# decision tree model
dtreeModel5 <- train(f_all_noneigh,
data = trainSet,
method = "rpart",
trControl = ctrl,
metric = "ROC",
na.action=na.exclude
)
dtreeModel5
CART
8398 samples
93 predictor
2 classes: 'NOT_BLIGHTED', 'BLIGHTED'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 1 times)
Summary of sample sizes: 6719, 6718, 6719, 6718, 6718
Resampling results across tuning parameters:
cp ROC Sens Spec
0.004950495 0.7185167 0.5970636 0.7588955
0.011774150 0.6973395 0.5434155 0.8166813
0.246989564 0.6092584 0.6929185 0.5255983
ROC was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.004950495.
Comparing the models:
resamps <- resamples(list(dtree_violonly = dtreeModel, dtree_plus311crime = dtreeModel2, dtree_plusneigh = dtreeModel3, dtree_pluscodes = dtreeModel4, dtree_all_minusneigh = dtreeModel5))
summary(resamps)
Call:
summary.resamples(object = resamps)
Models: dtree_violonly, dtree_plus311crime, dtree_plusneigh, dtree_pluscodes, dtree_all_minusneigh
Number of resamples: 5
ROC
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
dtree_violonly 0.7118 0.7136 0.7141 0.7182 0.7166 0.7351 0
dtree_plus311crime 0.7119 0.7158 0.7193 0.7200 0.7263 0.7265 0
dtree_plusneigh 0.7061 0.7079 0.7105 0.7145 0.7229 0.7252 0
dtree_pluscodes 0.7013 0.7068 0.7080 0.7079 0.7109 0.7127 0
dtree_all_minusneigh 0.7093 0.7170 0.7176 0.7185 0.7228 0.7259 0
Sens
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
dtree_violonly 0.5826 0.6524 0.6663 0.6505 0.6717 0.6795 0
dtree_plus311crime 0.5461 0.5644 0.5655 0.5713 0.5734 0.6073 0
dtree_plusneigh 0.5697 0.5777 0.5891 0.6166 0.6223 0.7242 0
dtree_pluscodes 0.5016 0.6674 0.6770 0.6454 0.6856 0.6953 0
dtree_all_minusneigh 0.5408 0.5687 0.5880 0.5971 0.6041 0.6838 0
Spec
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
dtree_violonly 0.6497 0.6524 0.6693 0.6867 0.6867 0.7751 0
dtree_plus311crime 0.7523 0.8046 0.8102 0.8006 0.8139 0.8222 0
dtree_plusneigh 0.6265 0.7166 0.7738 0.7394 0.7764 0.8035 0
dtree_pluscodes 0.6292 0.6310 0.6426 0.6954 0.6734 0.9011 0
dtree_all_minusneigh 0.6506 0.7206 0.7938 0.7589 0.8048 0.8246 0
Let’s try a random forest with all the variables. Being a more complex model, it seems to reach to a new best performance of 73.2% AUC on ROC (1.2% better than the best dtree), with 61.8% True Positive (TP), and 73.9% True Negative (TN)
# random forest model
rfModel2 <- train(condition ~ .,
data = trainSet,
method = "rf",
trControl = ctrl,
metric = "ROC"
)
rfModel2
Random Forest
8398 samples
101 predictor
2 classes: 'NOT_BLIGHTED', 'BLIGHTED'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 1 times)
Summary of sample sizes: 6718, 6718, 6719, 6719, 6718
Resampling results across tuning parameters:
mtry ROC Sens Spec
2 0.7316718 0.6179107 0.7385721
51 0.7069456 0.6865531 0.6237667
101 0.7037194 0.6891292 0.6202883
ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
rf2varImp <- varImp(rfModel2, scale = FALSE)
plot(rf2varImp, top=28)
Retraining Random Forest with only most important features. Using the top 17 features, the performance drops 0.01% to 72.9% AUC (compared to the RF model with 101 features)
trainSet2 <- trainSet %>% select(num_viols, max_amt, num_responsible, compliance, hazard, num_crime, waste, other, num_respons_neigh, total_max_amt_neigh, num_viols_neigh, avg_max_amt_neigh, diff_rating, num_crime_neigh, min_rating, num_311, max_rating, condition)
# random forest model
rfModel3 <- train(condition ~ .,
data = trainSet2,
method = "rf",
trControl = ctrl,
metric = "ROC"
)
We’ll try also gradient boosting. Let’s get started prepping the data for XGBoost
dtrain <- Matrix::sparse.model.matrix(condition ~ ., data=trainSet)
deval <- Matrix::sparse.model.matrix(condition ~ ., data=validSet)
Training and cross-validation of GBM with XGBoost Kudos to James Marquez for the awesome XGBoost recipe See here: www.jamesmarquezportfolio.com
grid <- expand.grid(nrounds = c(300, 400, 500, 600),
max_depth = c(3, 5, 7, 9),
eta = c(0.05, 0.1, 0.2, 0.3),
gamma = 1,
colsample_bytree = c(0.5, 1),
min_child_weight = 1,
subsample = 1)
ctrl <- trainControl(method = "cv",
number = 5,
summaryFunction = twoClassSummary,
classProbs = TRUE,
allowParallel = TRUE )
set.seed(107)
system.time(xgbTune <- train(x = dtrain,
y = factor(trainSet$condition),
method = "xgbTree",
metric = "ROC",
tuneGrid = grid,
verbose = TRUE,
trControl = ctrl))
The training data could not be converted to a data frame for saving
user system elapsed
2084.519 19.757 565.640
Gradient boosting reaches the best performance in crossvalidation, of 73.5% ROC/AUC, with a 63.6% true positive, and 71.2% true negative rates. The optimal hyperparmeters being: nrounds = 300, max_depth = 3, eta = 0.05, gamma = 1, colsample_bytree = 1, min_child_weight = 1 and subsample = 1.
xgbTune
eXtreme Gradient Boosting
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 6718, 6718, 6718, 6719, 6719
Resampling results across tuning parameters:
eta max_depth colsample_bytree nrounds ROC Sens Spec
0.05 3 0.5 300 0.6189223 0.4677120 0.7618438
0.05 3 0.5 400 0.6189223 0.4677120 0.7618438
0.05 3 0.5 500 0.6189223 0.4677120 0.7618438
0.05 3 0.5 600 0.6189223 0.4677120 0.7618438
0.05 3 1.0 300 0.7346546 0.6363473 0.7120707
0.05 3 1.0 400 0.7344732 0.6372057 0.7118030
0.05 3 1.0 500 0.7344732 0.6372057 0.7118030
0.05 3 1.0 600 0.7344732 0.6372057 0.7118030
0.05 5 0.5 300 0.6174453 0.4644934 0.7637169
0.05 5 0.5 400 0.6174453 0.4644934 0.7637169
0.05 5 0.5 500 0.6174453 0.4644934 0.7637169
0.05 5 0.5 600 0.6174453 0.4644934 0.7637169
0.05 5 1.0 300 0.7336117 0.6412797 0.7053809
0.05 5 1.0 400 0.7335572 0.6414943 0.7051135
0.05 5 1.0 500 0.7335572 0.6414943 0.7051135
0.05 5 1.0 600 0.7335572 0.6414943 0.7051135
0.05 7 0.5 300 0.6160731 0.4593443 0.7669276
0.05 7 0.5 400 0.6160731 0.4593443 0.7669276
0.05 7 0.5 500 0.6160731 0.4593443 0.7669276
0.05 7 0.5 600 0.6160731 0.4593443 0.7669276
0.05 7 1.0 300 0.7332625 0.6404206 0.7024322
0.05 7 1.0 400 0.7332625 0.6404206 0.7024322
0.05 7 1.0 500 0.7332625 0.6404206 0.7024322
0.05 7 1.0 600 0.7332625 0.6404206 0.7024322
0.05 9 0.5 300 0.6156239 0.4580572 0.7679975
0.05 9 0.5 400 0.6156239 0.4580572 0.7679975
0.05 9 0.5 500 0.6156239 0.4580572 0.7679975
0.05 9 0.5 600 0.6156239 0.4580572 0.7679975
0.05 9 1.0 300 0.7294676 0.6472876 0.6893313
0.05 9 1.0 400 0.7294676 0.6472876 0.6893313
0.05 9 1.0 500 0.7294676 0.6472876 0.6893313
0.05 9 1.0 600 0.7294676 0.6472876 0.6893313
0.10 3 0.5 300 0.6185266 0.4672824 0.7618442
0.10 3 0.5 400 0.6185266 0.4672824 0.7618442
0.10 3 0.5 500 0.6185266 0.4672824 0.7618442
0.10 3 0.5 600 0.6185266 0.4672824 0.7618442
0.10 3 1.0 300 0.7337158 0.6397782 0.7112672
0.10 3 1.0 400 0.7337158 0.6397782 0.7112672
0.10 3 1.0 500 0.7337158 0.6397782 0.7112672
0.10 3 1.0 600 0.7337158 0.6397782 0.7112672
0.10 5 0.5 300 0.6188520 0.4638494 0.7642517
0.10 5 0.5 400 0.6188520 0.4638494 0.7642517
0.10 5 0.5 500 0.6188520 0.4638494 0.7642517
0.10 5 0.5 600 0.6188520 0.4638494 0.7642517
0.10 5 1.0 300 0.7327701 0.6372054 0.7088575
0.10 5 1.0 400 0.7327701 0.6372054 0.7088575
0.10 5 1.0 500 0.7327701 0.6372054 0.7088575
0.10 5 1.0 600 0.7327701 0.6372054 0.7088575
0.10 7 0.5 300 0.6176513 0.4604166 0.7677301
0.10 7 0.5 400 0.6176513 0.4604166 0.7677301
0.10 7 0.5 500 0.6176513 0.4604166 0.7677301
0.10 7 0.5 600 0.6176513 0.4604166 0.7677301
0.10 7 1.0 300 0.7315357 0.6436402 0.6930703
0.10 7 1.0 400 0.7315357 0.6436402 0.6930703
0.10 7 1.0 500 0.7315357 0.6436402 0.6930703
0.10 7 1.0 600 0.7315357 0.6436402 0.6930703
0.10 9 0.5 300 0.6149977 0.4567694 0.7682649
0.10 9 0.5 400 0.6149977 0.4567694 0.7682649
0.10 9 0.5 500 0.6149977 0.4567694 0.7682649
0.10 9 0.5 600 0.6149977 0.4567694 0.7682649
0.10 9 1.0 300 0.7282451 0.6436393 0.6863826
0.10 9 1.0 400 0.7282451 0.6436393 0.6863826
0.10 9 1.0 500 0.7282451 0.6436393 0.6863826
0.10 9 1.0 600 0.7282451 0.6436393 0.6863826
0.20 3 0.5 300 0.6183800 0.4664245 0.7629137
0.20 3 0.5 400 0.6183800 0.4664245 0.7629137
0.20 3 0.5 500 0.6183800 0.4664245 0.7629137
0.20 3 0.5 600 0.6183800 0.4664245 0.7629137
0.20 3 1.0 300 0.7336891 0.6427830 0.7051114
0.20 3 1.0 400 0.7336891 0.6427830 0.7051114
0.20 3 1.0 500 0.7336891 0.6427830 0.7051114
0.20 3 1.0 600 0.7336891 0.6427830 0.7051114
0.20 5 0.5 300 0.6180756 0.4610606 0.7666602
0.20 5 0.5 400 0.6180756 0.4610606 0.7666602
0.20 5 0.5 500 0.6180756 0.4610606 0.7666602
0.20 5 0.5 600 0.6180756 0.4610606 0.7666602
0.20 5 1.0 300 0.7318531 0.6436391 0.7016275
0.20 5 1.0 400 0.7318531 0.6436391 0.7016275
0.20 5 1.0 500 0.7318531 0.6436391 0.7016275
0.20 5 1.0 600 0.7318531 0.6436391 0.7016275
0.20 7 0.5 300 0.6154024 0.4591290 0.7674616
0.20 7 0.5 400 0.6154024 0.4591290 0.7674616
0.20 7 0.5 500 0.6154024 0.4591290 0.7674616
0.20 7 0.5 600 0.6154024 0.4591290 0.7674616
0.20 7 1.0 300 0.7292590 0.6479293 0.6815587
0.20 7 1.0 400 0.7292590 0.6479293 0.6815587
0.20 7 1.0 500 0.7292590 0.6479293 0.6815587
0.20 7 1.0 600 0.7292590 0.6479293 0.6815587
0.20 9 0.5 300 0.6145506 0.4574132 0.7690681
0.20 9 0.5 400 0.6145506 0.4574132 0.7690681
0.20 9 0.5 500 0.6145506 0.4574132 0.7690681
0.20 9 0.5 600 0.6145506 0.4574132 0.7690681
0.20 9 1.0 300 0.7261859 0.6578033 0.6687216
0.20 9 1.0 400 0.7261859 0.6578033 0.6687216
0.20 9 1.0 500 0.7261859 0.6578033 0.6687216
0.20 9 1.0 600 0.7261859 0.6578033 0.6687216
0.30 3 0.5 300 0.6178397 0.4632060 0.7631818
0.30 3 0.5 400 0.6178397 0.4632060 0.7631818
0.30 3 0.5 500 0.6178397 0.4632060 0.7631818
0.30 3 0.5 600 0.6178397 0.4632060 0.7631818
0.30 3 1.0 300 0.7308070 0.6423474 0.7013627
0.30 3 1.0 400 0.7308070 0.6423474 0.7013627
0.30 3 1.0 500 0.7308070 0.6423474 0.7013627
0.30 3 1.0 600 0.7308070 0.6423474 0.7013627
0.30 5 0.5 300 0.6157272 0.4589140 0.7671950
0.30 5 0.5 400 0.6157272 0.4589140 0.7671950
0.30 5 0.5 500 0.6157272 0.4589140 0.7671950
0.30 5 0.5 600 0.6157272 0.4589140 0.7671950
0.30 5 1.0 300 0.7300087 0.6472887 0.6917288
0.30 5 1.0 400 0.7300087 0.6472887 0.6917288
0.30 5 1.0 500 0.7300087 0.6472887 0.6917288
0.30 5 1.0 600 0.7300087 0.6472887 0.6917288
0.30 7 0.5 300 0.6137818 0.4567687 0.7688007
0.30 7 0.5 400 0.6137818 0.4567687 0.7688007
0.30 7 0.5 500 0.6137818 0.4567687 0.7688007
0.30 7 0.5 600 0.6137818 0.4567687 0.7688007
0.30 7 1.0 300 0.7249734 0.6590897 0.6716760
0.30 7 1.0 400 0.7249734 0.6590897 0.6716760
0.30 7 1.0 500 0.7249734 0.6590897 0.6716760
0.30 7 1.0 600 0.7249734 0.6590897 0.6716760
0.30 9 0.5 300 0.6126548 0.4518343 0.7704053
0.30 9 0.5 400 0.6126548 0.4518343 0.7704053
0.30 9 0.5 500 0.6126548 0.4518343 0.7704053
0.30 9 0.5 600 0.6126548 0.4518343 0.7704053
0.30 9 1.0 300 0.7226647 0.6584438 0.6542799
0.30 9 1.0 400 0.7226647 0.6584438 0.6542799
0.30 9 1.0 500 0.7226647 0.6584438 0.6542799
0.30 9 1.0 600 0.7226647 0.6584438 0.6542799
Tuning parameter 'gamma' was held constant at a value of 1
Tuning
parameter 'min_child_weight' was held constant at a value of 1
Tuning
parameter 'subsample' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were nrounds = 300, max_depth = 3, eta = 0.05, gamma
= 1, colsample_bytree = 1, min_child_weight = 1 and subsample = 1.
ggplot(xgbTune) +
theme(legend.position = "top")
Ignoring unknown aesthetics: shape
Training a GBM with the best parameters, yields to 78.7% AUC
trainlab <- trainSet %>%
select(condition) %>%
mutate(lab = ifelse((condition=="BLIGHTED"),1,0)) %>%
select(lab)
param <- list(objective = 'binary:logistic',
eval_metric = 'auc',
max_depth = 3,
eta = 0.05,
gamma = 1,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1)
set.seed(107)
system.time(xgb <- xgboost(params = param,
data = dtrain,
label = trainlab$lab,
nrounds = 300,
print_every_n = 100,
verbose = 1))
[1] train-auc:0.728302
[101] train-auc:0.766306
[201] train-auc:0.778231
[300] train-auc:0.787431
user system elapsed
1.452 0.007 1.459
Below is a plot of feature importance
model <- xgb.dump(xgb, with_stats = TRUE)
names <- dimnames(dtrain)[[2]]
importance_matrix <- xgb.importance(names, model = xgb)[0:30]
xgb.plot.importance(importance_matrix)
Plotting the ROC, we can explore different values of true positive and false positive rates.
validlab <- validSet %>%
select(condition) %>%
mutate(lab = ifelse((condition=="BLIGHTED"),1,0)) %>%
select(lab)
xgbVal <- predict(xgb, newdata = deval)
xgb.pred <- ROCR::prediction(xgbVal, validlab$lab)
xgb.perf <- ROCR::performance(xgb.pred, "tpr", "fpr")
auc <- ROCR::performance(xgb.pred,"auc")
auc <- unlist(slot(auc, "y.values"))
auc<-round(auc, digits = 3)
auct <- paste(c("AUC = "),auc,sep="")
plot(xgb.perf,
avg="threshold",
colorize=TRUE,
lwd=3,
print.cutoffs.at=seq(0, 1, by=0.05),
text.adj=c(-0.5, 0.5),
text.cex=0.6)
grid(col="lightgray")
axis(1, at=seq(0, 1, by=0.1))
axis(2, at=seq(0, 1, by=0.1))
abline(v=c(0.1, 0.3, 0.5, 0.7, 0.9), col="lightgray", lty="dotted")
abline(h=c(0.1, 0.3, 0.5, 0.7, 0.9), col="lightgray", lty="dotted")
lines(x=c(0, 1), y=c(0, 1), col="black", lty="dotted")
#adding min/max AUC in the plot
legend(0.5,0.5,c(auct),border="white",cex=1.2,box.col = "white")
Let’s check the results in the validation set for the optimal operating point, around 0.47 yields to 74.7% TP and 63.9% TN
xgbVal <- predict(xgb, newdata = deval)
xgbVal.resp <- ifelse(xgbVal > 0.5, 1, 0)
confusionMatrix(xgbVal.resp, validlab$lab, positive = '1')
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 913 258
1 476 636
Accuracy : 0.6785
95% CI : (0.6589, 0.6976)
No Information Rate : 0.6084
P-Value [Acc > NIR] : 2.192e-12
Kappa : 0.3534
Mcnemar's Test P-Value : 1.151e-15
Sensitivity : 0.7114
Specificity : 0.6573
Pos Pred Value : 0.5719
Neg Pred Value : 0.7797
Prevalence : 0.3916
Detection Rate : 0.2786
Detection Prevalence : 0.4871
Balanced Accuracy : 0.6844
'Positive' Class : 1
Other data to explore in the future: - Neighboorhood/region - Demographic data - MLS / Zillow (last time sold, price, # sold houses around) - Detroit Parcel data